579 research outputs found
Diffusion of Lexical Change in Social Media
Computer-mediated communication is driving fundamental changes in the nature
of written language. We investigate these changes by statistical analysis of a
dataset comprising 107 million Twitter messages (authored by 2.7 million unique
user accounts). Using a latent vector autoregressive model to aggregate across
thousands of words, we identify high-level patterns in diffusion of linguistic
change over the United States. Our model is robust to unpredictable changes in
Twitter's sampling rate, and provides a probabilistic characterization of the
relationship of macro-scale linguistic influence to a set of demographic and
geographic predictors. The results of this analysis offer support for prior
arguments that focus on geographical proximity and population size. However,
demographic similarity -- especially with regard to race -- plays an even more
central role, as cities with similar racial demographics are far more likely to
share linguistic influence. Rather than moving towards a single unified
"netspeak" dialect, language evolution in computer-mediated communication
reproduces existing fault lines in spoken American English.Comment: preprint of PLOS-ONE paper from November 2014; PLoS ONE 9(11) e11311
Improving unsupervised learning with exemplarCNNs
Most recent unsupervised learning methods explore alternative objectives, often referred to as self-supervised tasks, to train convolutional neural networks without the supervision of human annotated labels. This paper explores the generation of surrogate classes as a self-supervised alternative to learn discriminative features, and proposes a clustering algorithm to overcome one of the main limitations of this kind of approach. Our clustering technique improves the initial implementation and achieves 76.4% accuracy in the STL-10 test set, surpassing the current state-of-the-art for the STL-10 unsupervised benchmark. We also explore several issues with the unlabeled set from STL-10 that should be considered in future research using this dataset
Self-Supervised and Semi-Supervised Polyp Segmentation using Synthetic Data
Early detection of colorectal polyps is of utmost importance for their
treatment and for colorectal cancer prevention. Computer vision techniques have
the potential to aid professionals in the diagnosis stage, where colonoscopies
are manually carried out to examine the entirety of the patient's colon. The
main challenge in medical imaging is the lack of data, and a further challenge
specific to polyp segmentation approaches is the difficulty of manually
labeling the available data: the annotation process for segmentation tasks is
very time-consuming. While most recent approaches address the data availability
challenge with sophisticated techniques to better exploit the available labeled
data, few of them explore the self-supervised or semi-supervised paradigm,
where the amount of labeling required is greatly reduced. To address both
challenges, we leverage synthetic data and propose an end-to-end model for
polyp segmentation that integrates real and synthetic data to artificially
increase the size of the datasets and aid the training when unlabeled samples
are available. Concretely, our model, Pl-CUT-Seg, transforms synthetic images
with an image-to-image translation module and combines the resulting images
with real images to train a segmentation model, where we use model predictions
as pseudo-labels to better leverage unlabeled samples. Additionally, we propose
PL-CUT-Seg+, an improved version of the model that incorporates targeted
regularization to address the domain gap between real and synthetic images. The
models are evaluated on standard benchmarks for polyp segmentation and reach
state-of-the-art results in the self- and semi-supervised setups
Embedding contrastive unsupervised features to cluster in- and out-of-distribution noise in corrupted image datasets
Using search engines for web image retrieval is a tempting alternative to
manual curation when creating an image dataset, but their main drawback remains
the proportion of incorrect (noisy) samples retrieved. These noisy samples have
been evidenced by previous works to be a mixture of in-distribution (ID)
samples, assigned to the incorrect category but presenting similar visual
semantics to other classes in the dataset, and out-of-distribution (OOD)
images, which share no semantic correlation with any category from the dataset.
The latter are, in practice, the dominant type of noisy images retrieved. To
tackle this noise duality, we propose a two stage algorithm starting with a
detection step where we use unsupervised contrastive feature learning to
represent images in a feature space. We find that the alignment and uniformity
principles of contrastive learning allow OOD samples to be linearly separated
from ID samples on the unit hypersphere. We then spectrally embed the
unsupervised representations using a fixed neighborhood size and apply an
outlier sensitive clustering at the class level to detect the clean and OOD
clusters as well as ID noisy outliers. We finally train a noise robust neural
network that corrects ID noise to the correct category and utilizes OOD samples
in a guided contrastive objective, clustering them to improve low-level
features. Our algorithm improves the state-of-the-art results on synthetic
noise image datasets as well as real-world web-crawled data. Our work is fully
reproducible github.com/PaulAlbert31/SNCF.Comment: Accepted at ECCV 202
Joint one-sided synthetic unpaired image translation and segmentation for colorectal cancer prevention
Deep learning has shown excellent performance in analysing medical images.
However, datasets are difficult to obtain due privacy issues, standardization
problems, and lack of annotations. We address these problems by producing
realistic synthetic images using a combination of 3D technologies and
generative adversarial networks. We propose CUT-seg, a joint training where a
segmentation model and a generative model are jointly trained to produce
realistic images while learning to segment polyps. We take advantage of recent
one-sided translation models because they use significantly less memory,
allowing us to add a segmentation model in the training loop. CUT-seg performs
better, is computationally less expensive, and requires less real images than
other memory-intensive image translation approaches that require two stage
training. Promising results are achieved on five real polyp segmentation
datasets using only one real image and zero real annotations. As a part of this
study we release Synth-Colon, an entirely synthetic dataset that includes 20000
realistic colon images and additional details about depth and 3D geometry:
https://enric1994.github.io/synth-colonComment: arXiv admin note: substantial text overlap with arXiv:2202.0868
Reliable Label Bootstrapping for Semi-Supervised Learning
Reducing the amount of labels required to train convolutional neural networks
without performance degradation is key to effectively reduce human annotation
efforts. We propose Reliable Label Bootstrapping (ReLaB), an unsupervised
preprossessing algorithm which improves the performance of semi-supervised
algorithms in extremely low supervision settings. Given a dataset with few
labeled samples, we first learn meaningful self-supervised, latent features for
the data. Second, a label propagation algorithm propagates the known labels on
the unsupervised features, effectively labeling the full dataset in an
automatic fashion. Third, we select a subset of correctly labeled (reliable)
samples using a label noise detection algorithm. Finally, we train a
semi-supervised algorithm on the extended subset. We show that the selection of
the network architecture and the self-supervised algorithm are important
factors to achieve successful label propagation and demonstrate that ReLaB
substantially improves semi-supervised learning in scenarios of very limited
supervision on CIFAR-10, CIFAR-100 and mini-ImageNet. We reach average error
rates of with 1 random labeled sample per class on
CIFAR-10 and lower this error to when the labeled sample in
each class is highly representative. Our work is fully reproducible:
https://github.com/PaulAlbert31/ReLaB.Comment: 10 pages, 3 figure
Towards Robust Learning with Different Label Noise Distributions
Noisy labels are an unavoidable consequence of labeling processes and
detecting them is an important step towards preventing performance degradations
in Convolutional Neural Networks. Discarding noisy labels avoids a harmful
memorization, while the associated image content can still be exploited in a
semi-supervised learning (SSL) setup. Clean samples are usually identified
using the small loss trick, i.e. they exhibit a low loss. However, we show that
different noise distributions make the application of this trick less
straightforward and propose to continuously relabel all images to reveal a
discriminative loss against multiple distributions. SSL is then applied twice,
once to improve the clean-noisy detection and again for training the final
model. We design an experimental setup based on ImageNet32/64 for better
understanding the consequences of representation learning with differing label
noise distributions and find that non-uniform out-of-distribution noise better
resembles real-world noise and that in most cases intermediate features are not
affected by label noise corruption. Experiments in CIFAR-10/100, ImageNet32/64
and WebVision (real-world noise) demonstrate that the proposed label noise
Distribution Robust Pseudo-Labeling (DRPL) approach gives substantial
improvements over recent state-of-the-art. Code is available at
https://git.io/JJ0PV
Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning
Semi-supervised learning, i.e. jointly learning from labeled and unlabeled
samples, is an active research topic due to its key role on relaxing human
supervision. In the context of image classification, recent advances to learn
from unlabeled samples are mainly focused on consistency regularization methods
that encourage invariant predictions for different perturbations of unlabeled
samples. We, conversely, propose to learn from unlabeled data by generating
soft pseudo-labels using the network predictions. We show that a naive
pseudo-labeling overfits to incorrect pseudo-labels due to the so-called
confirmation bias and demonstrate that mixup augmentation and setting a minimum
number of labeled samples per mini-batch are effective regularization
techniques for reducing it. The proposed approach achieves state-of-the-art
results in CIFAR-10/100, SVHN, and Mini-ImageNet despite being much simpler
than other methods. These results demonstrate that pseudo-labeling alone can
outperform consistency regularization methods, while the opposite was supposed
in previous work. Source code is available at https://git.io/fjQsC
- …